3  Results

3.1 Contributing Factors

The investigation focuses on discerning patterns in contributing factors influencing traffic accidents in NYC. Initially, we aim to identify prevalent factors contributing to these incidents. Subsequently, an exploration ensues to ascertain potential associations between these contributing factors and variables such as Visibility, Wind’s Direction, and Wind’s Speed. This analytic approach aims to unveil underlying relationships and enhance our understanding of the multifaceted dynamics that influence traffic accidents in the city. By examining the interplay of contributing factors with weather-related variables, we seek to provide valuable insights that contribute to a comprehensive understanding of the factors influencing road safety in the urban context.

  • Factors Association
Code
library(plotly, quietly = TRUE)
library(rjson, quietly = TRUE)
library(geojsonio, quietly = TRUE)
library(httr, quietly = TRUE)
library(jsonlite, quietly = TRUE)
library(corrplot, quietly = TRUE)
library(ggplot2, quietly = TRUE)
library(reshape2, quietly = TRUE)
library(dplyr, quietly = TRUE)
library(lubridate, quietly = TRUE)
library(redav, quietly = TRUE)
library(ggalluvial, quietly = TRUE)
library(ggridges, quietly = TRUE)
library(tidyr, quietly = TRUE)
library(forcats, quietly = TRUE)
library(tidytext, quietly = TRUE)
library(tidyverse, quietly = TRUE)
data <- read.csv('preprocess/NY_Accidents_March23.csv')
reason <- data[, c(14, 30:42)]
reason[, -1] <- lapply(reason[, -1], as.logical)
cor_matrix <- cor(reason[, 2:14])
ggplot(melt(cor_matrix), aes(Var1, Var2, fill = value)) +
  geom_tile(color = "white") +
  scale_fill_gradient2(low = "blue", high = "red", mid = "white", midpoint = 0) +
  theme_minimal() +
  labs(title = "Association Between Contributing Factors") +
  theme(axis.title.x = element_blank(),
        axis.title.y = element_blank())

In the association heat map, two variables stand out as peculiar: “Turning_Loop” and “Roundabout.” Both variables exhibit only one value (False in this case). Consequently, we can safely exclude these two variables from our subsequent analysis, as their singular values do not contribute meaningfully to our research outcomes. This strategic decision allows us to focus on the variables that offer more variation and potential impact on our results.

Code
cor_matrix_new <- cor(reason[, c(2:8, 10:13)])
corrplot(cor_matrix_new, method = "color", addCoef.col = "black")

Moreover, a quantitative analysis through a graph provides clearer insights. Notably, the variable “Crossing” exhibits a strong correlation with “Traffic_Signal,” which is logical given the intricate scenarios at intersections where drivers must adhere to traffic signals to prevent accidents. Additionally, “Traffic_Calming” and “Bump” show some association, as the presence of additional traffic calming measures increases the likelihood of encountering a bump, whether it involves interactions between vehicles or with these barriers. On the contrary, several factors, such as “Give_Away,” “Junction,” and “No_Exit,” lack a discernible association with others.

  • Top-3 Factors Pattern
Code
column_sums <- colSums(reason[, 2:14])
top_3 <- names(sort(column_sums, decreasing = TRUE)[1:3])
top_3_reason <- reason[, c("County", top_3)]

top_3_reason <- top_3_reason %>%
  gather(Factors, Value, -County) %>%
  filter(Value == TRUE) %>%
  select(County, Factors)

top_3_reason_cur <- top_3_reason[1:1000,]
ggplot(top_3_reason_cur, aes(axis1 = County, axis2 = Factors)) +
  geom_alluvium(aes(fill = Factors)) +
  geom_stratum(aes(fill = County)) +
  geom_text(stat = "stratum", aes(label = after_stat(stratum))) +
  theme_minimal() +
  labs(title = "Alluvial Plot: County to Factors")

Given the huge data amount of our dataset, we have opted to visualize only the top-3 contributing factors using the ggalluvium plot. In conjunction with the previously presented graphs, it becomes evident that to mitigate traffic accidents effectively, governmental focus and resources should be directed towards enhancing the safety and maintenance of crossing, junction, and traffic signal infrastructure across all listed areas in NYC. These factors not only significantly contribute to accidents but also exhibit interconnectedness, as seen in the relationship between crossing and traffic signals.

  • Quantitative Factors

We would like to further research on some other quantitative factors, to see if there is some distribution difference in different logic factors.

Code
q_reason <- data[, c(21, 23:25, 27, 30:42)]
column_sums <- colSums(q_reason[, 6:18])
top_9 <- names(sort(column_sums, decreasing = TRUE)[1:9])

vis <- q_reason[, c("Visibility.mi.", top_9)]

long_vis <- gather(vis, key = "Factors", value = "Value", 2:10)

ggplot(long_vis, aes(y = Value, x = Visibility.mi.)) +
  geom_density_ridges() +
  facet_wrap(~ Factors, scales = "free_y") +
  theme_minimal() +
  labs(title = "Ridge graph of Visibility by Factors")

While there may be an initial inclination to attribute accidents to visibility issues, our ridge density graphs reveal that the top nine common factor labels in traffic accidents do not exhibit discernible differences in visibility distribution between True and False cases.

Code
tem <- q_reason[, c("Temperature.F.", top_9)]

long_tem <- gather(tem, key = "Factors", value = "Value", 2:10)

ggplot(long_tem, aes(y = Value, x = Temperature.F.)) +
  geom_density_ridges() +
  facet_wrap(~ Factors, scales = "free_y") +
  theme_minimal() +
  labs(title = "Ridge graph of Temperature by Factors")

However, we observe variations concerning temperature. Specifically, we note that temperatures tend to concentrate in two extremes, indicating that both cold and hot weather conditions may have an impact on road conditions and drivers’ concentration, which lead to a traffic accident associated with this factor label.

3.3 Spatial Variation in Traffic Accident:

  • Temporal Trend for Spatial Data

We now will step into the spatial character for our dataset. However, for a fair comparison, we need to check the temporal trend for the amount of traffic accidents in differen area.

Code
month_counts <- quant_data %>%
  group_by(County, Year) %>%
  summarise(UniqueMonths = n_distinct(Month))

year_data <- quant_data %>%
  group_by(County, Year) %>%
  summarise(Count = n())

avg_data <- year_data %>%
  left_join(month_counts, by = c("County", "Year")) %>%
  mutate(AvgCount = Count / UniqueMonths)

ggplot(avg_data, aes(x = Year, y = AvgCount, fill = County)) +
  geom_bar(stat = "identity", position = "dodge", color = "black") +
  labs(title = "Average Count per Month for Each County",
       x = "Year",
       y = "Average Accident Count per Month") +
  theme_minimal() +
  facet_wrap(~ County, scales = "free_y")

Analyzing the graph, we observe a consistent upward trend in the monthly averaged number of traffic accidents from 2016 to 2022. Particularly noteworthy is a sudden spike in 2022, resulting in a significant increase in traffic accident occurrences and subsequent societal losses. In Queens, a slight decrease is evident in 2018, but attention should be directed to Richmond, where the order of accident numbers deviates from the overall trend, as previously indicated in the graph.

Richmond presents a peculiar pattern from 2017 to 2021, resembling a “U” shape and resulting in a bimodal distribution. This anomaly is tentatively attributed to potential issues with data collection. Despite utilizing monthly averages for each year, there is a notable decrease in 2023. It’s crucial to consider that our data collection only extends until March of this year, and as we’ve established earlier, traffic accidents are less frequent from October to April of the following year.

  • Severity Analysis in Different Area

We would also like to analysis the Severity distribution and their potential cause in 5 different areas.

Code
sever_data <- data[, c(3, 14, 30:42)]
sever_data <- sever_data %>%
  mutate(TrueCount = rowSums(select(., 3:15) == TRUE))

heatmap_data <- sever_data %>%
  group_by(County, Severity, TrueCount) %>%
  summarise(Count = n())
  
ggplot(heatmap_data, aes(x = TrueCount, y = Severity, fill = Count)) +
  geom_tile() +
  scale_fill_viridis_c() +
  facet_wrap(~ County) +
  labs(title = "Heatmap of Severity vs Factors by County",
       x = "Factors Count",
       y = "Severity",
       fill = "Count")

Analyzing the heatmap, we observe a consistent pattern in the distribution of factor labels across different severities. In all areas, the majority of traffic accidents occur with 0-1 factor labels and severity levels 2-3. This finding reinforces the notion that attributing the severity of traffic accidents to specific areas might not be a significant factor. As a result, our primary focus should be on understanding and addressing the spatial density distribution of traffic accidents in NYC.

3.4 Comparative Analysis of Boroughs:

Next, we will examine spatial distribution in accident density across different boroughs in NYC in our dataset (Bronx, Queens, Kings, New York and Richmond).

Code
# Data processing
# fetch GeoJson data
url <- "https://services5.arcgis.com/GfwWNkhOj9bNBqoJ/arcgis/rest/services/NYC_Census_Tracts_for_2020_US_Census/FeatureServer/0/query?where=1=1&outFields=*&outSR=4326&f=pgeojson"
geojson <- rjson::fromJSON(file=url)
loc <- c()
for (i in 1:2325){
  loc[i] = geojson$features[[i]]$properties$GEOID
}
# df <- data.frame(matrix(ncol = 3, nrow = 0))
# colnames(df) <- c("GEOID", "count", "area")
# for (i in 1:2325){
#   df[i, ] = list(loc[i], 0, geojson$features[[i]]$properties$Shape__Area)
# }
# # Convert lat&lng to Geoid and save the accident counts in df
# for (i in seq(1, length(data$ID))) {
#   res = GET("https://geo.fcc.gov/api/census/area",
#       query = list(lat = data[i,]$Start_Lat, lon = data[i,]$Start_Lng, censusYear=2020,format="json"))
#   
#   temp = substr(fromJSON(rawToChar(res$content))$results$block_fips[1],1,11)
#   df[df["GEOID"] == temp,]["count"] = df[df["GEOID"] == temp,]["count"] + 1
# }
# 
# df$density <- df$count / df$area
# write.csv(df,file="preprocess/density.csv")
Code
df <- read.csv("preprocess/density.csv")
g <- list(
  fitbounds = "locations",
  visible = FALSE)
 
 
fig <- plot_ly() %>%  
  add_trace(
  type="choropleth",
  geojson=geojson, #Coordinates
  locations=loc, #Geoid
  z=df$density,
  colorscale="Blues",
  reversescale=TRUE,
  featureidkey="properties.GEOID") %>%
  layout(geo = g)
 
fig

In this plotly interactive graph, we could see the distribution of the traffic density in NYC (within those 5 boroughs). Note there are many other areas in New York City we do not count in, making the graph relatively sparse.